Doing sociophonetics with LAP data

Josef Fruehwald

Introduction

  • Quick intro to the LAP data

  • Overview of the contemporary sociophonetics workflow

  • The unique issues posed by the LAP data

  • Initial approach to addressing the issues

LAP Data

Linguistic Atlas of the North-Central States

LANCS data

Kentucky Data

Using Automatic Speech Recognition on the LAP

A Typical Sociophonetics workflow

Time intensiveness

This first portion of the diagram is the most time intensive part of the process after fieldwork is over and before analysis begins.

Best case scenario is 10 hours of transcription for every 1 hour of audio.

Time intensiveness

LANCS Audio ~177 hours
Total Transcription time 1770 to 2700 hours

Time to Transcription

(1 RA @ 15 hr/wk)

2.5 to 3.5 years

Cost of Transcription

(@ $15/hr)

$26,500 to $40,500

My Original Plan for LAP data

Replace this

My Original Plan for LAP data

With this

wav2vec fine tuning

Initial experiments fine tuning a pretrained wav2vec2 model on 3.5 hours of PNC data resulted in

  • eval word error rate = 0.34

  • eval character error rate = 0.189

However

Audio

All ASR systems are trained with labelled audio. When properties of the training audio and the use case audio are very different, they may not perform well. This includes

Audio

An example of training audio

which circumstances do not permit him to employ

(source: LibriSpeech (Panayotov et al. 2015))

An example of LANCS audio

well, uh, you mean monday, tuesday, wednesday, thursday, friday, saturday

Using Automatic Speech Recognition on the LAP

Pre-processing LAP Audio

Consistent issues

To the extent there are consistent issues across LANCS data, we can develop pre-processing workflows for them.

Issue 1: 60hz (and harmonics) mains hum

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide

Issue 2: Microphone Hits

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide

Issue 3: Low signal to noise ratio

well, uh, you mean monday, tuesday, wednesday, thursday, friday, saturday

Meta Issue: Different Recording sessions on one tape

Meta Issue: The available metadata

Audio metadata

The Processing

Open Tools

Most of the processing I’m showing here was done with the librosa library in python

Step 1: Session separation

Time stamps for separate sessions were recorded in a yaml file

KY1A__1B_02_b:
  - KYUNK1:
      part: 2
      start: 0.00
      end: 1650.0
  - KY1B:
      part: 2
      start: 1650.0
      end: 5745.75

Step 2: Addressing the Buzz

To deal with the low frequency hum, preemphasis and a high-pass filter were applied to the audio.

#loading the audio
y_fireplace, sr = librosa.load("assets/fireplace_short.wav", sr = 16000)

# default librosa preemphasis
y_fireplace2 = librosa.effects.preemphasis(y_fireplace)

# getting parameters for the highpass filter
b, a = scipy.signal.butter(N = 1,            # a fairly gradual slope
                           Wn = 180,         # critical frequency at 60*3
                           btype="highpass", # highpass filter
                           fs = 16000,       # sampling rate
                           output= "ba")     # kind of output

# The actual filtering
y_fireplace2 = scipy.signal.filtfilt(b = b, a = a, x = y_fireplace2)
code as a function
def highpass(y, sr = 16000, order = 1, critical_freq = 180):
  """
  return a highpass filter of the signal y
  """
  
  b, a = scipy.signal.butter(N = order, 
                             Wn = critical_freq,
                             btype= "highpass",
                             fs = sr,
                             output = "ba")
  out_signal = scipy.signal.filtfilt(b = b, a = a, x = y)
  return(out_signal)

Step2: Addressing the Buzz

Before:

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide

After:

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide

Addressing the mic hits

The librosa package has implemented a method for decomposing audio into its percussive vs harmonic components, intended to separate drum tracks from melodies. (FitzGerald 2010; Driedger 2014)

Addressing the mic hits

# short-time fourier transform
D = librosa.stft(y_fireplace2, n_fft = 2048, win_length = 512, hop_length = 512//4)

# decomposition into harmonic and percussive
D_harm, D_perc = librosa.decompose.hpss(D, margin = 3)

# capture residual component
D_resid = D - (D_harm + D_perc)

# separating magnitude from phase
D_perc_m, D_perc_p = librosa.magphase(D_perc)

# converting to db
D_perc_db = librosa.amplitude_to_db(D_perc_m)

# just subtrating 20db from percussive
D_perc_db = D_perc_db-20

#back to amplitude
D_perc_new_m = librosa.db_to_amplitude(D_perc_db)

# recombining with phase
D_perc_new = D_perc_new_m * D_perc_p

# adding it all together
new_D = D_harm + D_perc_new + D_resid

# back to signal
y_fireplace3 = librosa.istft(new_D, n_fft = 2048, win_length = 512, hop_length = 512//4)

# re-normalize the output
y_fireplace3 = librosa.util.normalize(y_fireplace3)
code as a function
def dampen_hit(y, 
               sr = 16000, 
               n_fft = 2048,
               win_length = 512,
               hop_length = 512//4,
               margin = 3, 
               by_db = 20):
    """
    using harmonic/percussive decomposition, dampen mic hits
    """
    D = librosa.stft(y, n_fft = n_fft, win_length = win_length, hop_length = hop_length)
    
    # decomposition into harmonic and percussive
    D_harm, D_perc = librosa.decompose.hpss(D, margin = margin)
  
    # capture residual component
    D_resid = D - (D_harm + D_perc)
    
    # separating magnitude from phase
    D_perc_m, D_perc_p = librosa.magphase(D_perc)
    
    # converting to db
    D_perc_db = librosa.amplitude_to_db(D_perc_m)
    
    # just subtrating 20db from percussive
    D_perc_db = D_perc_db-by_db
    
    #back to amplitude
    D_perc_new_m = librosa.db_to_amplitude(D_perc_db)
    
    # recombining with phase
    D_perc_new = D_perc_new_m * D_perc_p
    
    # adding it all together
    new_D = D_harm + D_perc_new + D_resid
    
    # back to signal
    out_signal = librosa.istft(new_D, n_fft = n_fft, win_length = win_length, hop_length = hop_length)
    
    # re-normalize the output
    out_signal = librosa.util.normalize(out_signal)
    
    return(out_signal)

Addressing mic hits

Before

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide

After

A: would you describe the fireplace please
B: well it was just a, I guess about three foot wide

Noise Reduction

There are a few methods out there for noise reduction, including “Per Channel Energy Normalization” (Wang et al. n.d.; Lostanlen et al. 2019) and “Spectral Gating” (Sainburg & Thielk & Gentner 2020; Sainburg et al. 2022). I’ve found spectral gating to have better results for the final audio.

Spectral gating

You start off with a spectrogram

Spectral Gating

Then, you smear it out across the time domain. Since I’m dealing with pretty stable consistent noise, I’ve chosen a long window for smearing (3 seconds).

Spectral Gating

see how much above background the signal is

…convert that into a multiplier between 0 and 1

Spectral Gating

Soften out the edges a bit more

Multiply it by the original spectrogram

Noise Reduction

before

after

The end

The Very Start

The end

A more challenging case

well, uh, you mean monday, tuesday, wednesday, thursday, friday, saturday

A more challenging case

from noisereduce import reduce_noise
y_week, sr = librosa.load("assets/weekday_short.wav", sr = 16000)
y_week2 = librosa.effects.preemphasis(y_week)
y_week3 = highpass(y = y_week2)
y_week4 = librosa.util.normalize(dampen_hit(y_week3))
y_week5 = librosa.effects.deemphasis(y_week4)
y_week6 = reduce_noise(y_week5, 
                       sr=16000, 
                       n_fft=2048,
                       win_length = 512,
                       hop_length=512//4,
                       time_constant_s=3,
                       thresh_n_mult_nonstationary=3,
                       sigmoid_slope_nonstationary=3.5,
                       freq_mask_smooth_hz=500,
                       time_mask_smooth_ms=100)
y_week7 = librosa.util.normalize(y_week6)

Result

well, uh, you mean monday, tuesday, wednesday, thursday, friday, saturday

Doing this yourself

  1. Install conda
  2. Download the conda environment I’m using
  3. run conda env create -f audioprocess.yml to install all the dependencies
  4. run conda activate audioprocess.yml
  5. Experiment with the code from the slides or the source file.

Moving Forward

Different preprocessing for different purposes?

Speeding up human labeling with automation

It might be possible to speed up the diarization process by

  1. Running automated voice activity detection on the processed audio
  2. Exporting identified areas of voice activity to an elan file with a “constrained vocabulary” for diarization.

Preliminary results

IVR: I have a number of uh things I'd like to ask you about, I wonder if you wouldn't mind answering questions one after another

KY25A: Yeah well, you might start that I was born in 1867

References

Baevski, Alexei & Zhou, Henry & Mohamed, Abdelrahman & Auli, Michael. n.d. wav2vec 2.0: A framework for self-supervised learning of speech representations.
Bredin, Hervé & Laurent, Antoine. n.d. End-to-end speaker segmentation for overlap-aware resegmentation. DOI: https://doi.org/10.48550/arXiv.2104.04045
Driedger, Jonathan. 2014. EXTENDING HARMONIC-PERCUSSIVE SEPARATION OF AUDIO SIGNALS 6.
FitzGerald, Derry. 2010. Harmonic/Percussive Separation Using Median Filtering 4.
Lostanlen, Vincent & Salamon, Justin & Cartwright, Mark & McFee, Brian & Farnsworth, Andrew & Kelling, Steve & Bello, Juan Pablo. 2019. Per-Channel Energy Normalization: Why and How. IEEE Signal Processing Letters 26(1). 39–43. DOI: https://doi.org/10.1109/LSP.2018.2878620
McFee, Brian & Metsai, Alexandros & McVicar, Matt & Balke, Stefan & Thomé, Carl & Raffel, Colin & … Kim, Taewoon. 2022. Librosa/librosa: 0.9.2. Zenodo. DOI: https://doi.org/10.5281/zenodo.6759664
Panayotov, Vassil & Chen, Guoguo & Povey, Daniel & Khudanpur, Sanjeev. 2015. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) 5206–5210. DOI: https://doi.org/10.1109/ICASSP.2015.7178964
Sainburg, Tim & Saghiran, Ali & Amr, Kareem & fardage & rjfarber. 2022. Timsainb/noisereduce: v2.0.1. Zenodo. DOI: https://doi.org/10.5281/zenodo.6547070
Sainburg, Tim & Thielk, Marvin & Gentner, Timothy Q. 2020. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires. PLOS Computational Biology 16(10). e1008228. DOI: https://doi.org/10.1371/journal.pcbi.1008228
Tatman, Rachael & Kasten, Conner. 2017. Interspeech 2017 934–938. ISCA. DOI: https://doi.org/10.21437/Interspeech.2017-1746
Wang, Yuxuan & Getreuer, Pascal & Hughes, Thad & Lyon, Richard F. & Saurous, Rif A. n.d. Trainable frontend for robust and far-field keyword spotting.
Wassink, Alicia Beckford & Gansen, Cady & Bartholomew, Isabel. 2022. Uneven success: automatic speech recognition and ethnicity-related dialects. Speech Communication 140. 50–70. DOI: https://doi.org/10.1016/j.specom.2022.03.009